You can get 10 to 20 points
Every project must include brief description of the dataset
What metrics scores you have decided to use
Try minimally 2 different models
Mandatory part of every project is a summary at the end in which you summarize the most interesting insight obtained.
Result is a Jupyter Notebook with descriptions included or a PDF report + source codes.
Deadline is 10. 4. 2022
import sys
import logging
def add_logger(path='log.txt'):
nblog = open(path, "a+")
sys.stdout.echo = nblog
sys.stderr.echo = nblog
get_ipython().log.handlers[0].stream = nblog
get_ipython().log.setLevel(logging.INFO)
add_logger()
import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import tensorflow.compat.v2 as tf #use tensorflow v2 as a main
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.preprocessing import normalize # normalization of the matrix
from scipy.signal import convolve2d # convolutionof the 2D signals
import os
import plotly.express as px
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import time
import unicodedata, re, string
import nltk
Datová sada se skládá ze záznamů obsahující 6 sloupců:
import os
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
#path_to_dataset = "/content/drive/MyDrive/School/VSB/NLP_datasets/sentiment140"
path_to_dataset = "./sentiment140"
FULL_FILE_NAME = "training.1600000.processed.noemoticon.csv"
path = f"{path_to_dataset}/{FULL_FILE_NAME}"
full_data = pd.read_csv(path, encoding='latin-1', header=None)
full_data.columns = ['label', 'id', 'date', '-', 'user', 'text']
full_data.head()
| label | id | date | - | user | text | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | _TheSpecialOne_ | @switchfoot http://twitpic.com/2y1zl - Awww, t... |
| 1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can't update his Facebook by ... |
| 2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man... |
| 3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
| 4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it's not behaving at all.... |
Můžeme vidět, že datová sada je vyvážená. Vyskytuje se v ní 800 000 pozitivních a 800 000 negativních tweetu. Pro naše účely datovou sadu zmenšíme, abychom nemuseli modely trénovat takovou dobu.
full_data.label.value_counts()
0 800000 4 800000 Name: label, dtype: int64
Vybere 30000 tweetu z každé třídy společně s výběrem požadovaných sloupců.
NORM_VALUE = 30000
def undersample(dataframe, normalization_value=NORM_VALUE):
positive = dataframe[dataframe.label == 4].sample(NORM_VALUE)
negative = dataframe[dataframe.label == 0].sample(NORM_VALUE)
merged = pd.concat([positive, negative])
merged = merged.loc[:, ['label', 'text']]
return merged
sampled_data_for_project = undersample(full_data)
sampled_data_for_project.head()
| label | text | |
|---|---|---|
| 1318077 | 4 | @SallysChateau Painful thoughts now about Agas... |
| 1231559 | 4 | I'm so glad my golf game is bad because of my ... |
| 902295 | 4 | @lifeincyan Aren't randoms what it's about?!!!... |
| 1228105 | 4 | @tim621 Too much partying after the big Red Wi... |
| 802801 | 4 | @Babybandit my sister is like really good so i... |
Z každé třídy jsme vybrali požadovaný počet záznamů.
sampled_data_for_project.label.value_counts()
4 30000 0 30000 Name: label, dtype: int64
Pozitivní třída, která je popsána číselnou hodnotou 4 bude transformována na 1. Takhle je zřejmější binární klasifikace.
sampled_data_for_project.label = list(map(lambda x: 0 if x == 0 else 1, sampled_data_for_project.label.values))
sampled_data_for_project.label.value_counts()
1 30000 0 30000 Name: label, dtype: int64
Vytvoříme statické soubory pro každou množinu (train, test, valid). Tyto soubory uložíme na disk. Tímto krokem ztrácíme možnost k-fold validace. Ale pro tento projekt nám vystačí porovnání mezi jednotlivými architekturami. Neplánujeme dělat k-fold cross validace, a proto si to můžeme dovolit.
TEST_SIZE = 0.2
VALID_SIZE = 0.1
RANDOM_STATE = 13
SEP = ';'
TRAIN_PATH = os.path.sep.join([path_to_dataset, 'train.csv'])
TEST_PATH = os.path.sep.join([path_to_dataset, 'test.csv'])
VALID_PATH = os.path.sep.join([path_to_dataset, 'valid.csv'])
X_train, X_test, y_train, y_test = train_test_split(sampled_data_for_project.text, sampled_data_for_project.label, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=sampled_data_for_project.label)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE, stratify=y_train)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-0c546b68aac9> in <module>() ----> 1 X_train, X_test, y_train, y_test = train_test_split(sampled_data_for_project.text, sampled_data_for_project.label, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=sampled_data_for_project.label) 2 X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE, stratify=y_train) NameError: name 'sampled_data_for_project' is not defined
def create_dataframe(X, y):
df = pd.DataFrame()
df['text'] = X
df['label'] = y
return df
train = create_dataframe(X_train, y_train)
test = create_dataframe(X_test, y_test)
valid = create_dataframe(X_valid, y_valid)
Vytvořené množiny uložíme na disk.
train.to_csv(TRAIN_PATH, sep=SEP)
test.to_csv(TEST_PATH, sep=SEP)
valid.to_csv(VALID_PATH, sep=SEP)
import os
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
train = pd.read_csv(TRAIN_PATH, sep=SEP)
test = pd.read_csv(TEST_PATH, sep=SEP)
valid = pd.read_csv(VALID_PATH, sep=SEP)
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 3) (12000, 3) (4800, 3)
## Knihovny, které je potřeba doinstaloval pro potřebu projektu
!pip install gensim
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (3.6.0) Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.21.5) Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.15.0) Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.2.1) Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.4.1)
import gensim.downloader as api
import gzip
def show_history(history):
plt.figure()
for key in history.history.keys():
plt.plot(history.epoch, history.history[key], label=key)
plt.legend()
plt.tight_layout()
Metoda, která nám z existující embedding matice vytvoří matici vzhledem k našemu slovníku.
def prepare_embeddings_matrix(input_dic, embedding_dimension, vocab):
num_tokens = len(vocab) + 2
hits = 0
misses = 0
embedding_matrix = np.zeros((num_tokens, embedding_dimension))
for i, word in enumerate(vocab):
embedding_vector = None
if word in input_dic:
embedding_vector = input_dic[word]
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
hits += 1
else:
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
return embedding_matrix, num_tokens, hits, misses
Pomocná metoda, která nám načte stažený model s embedding vektory.
def load_model(File):
with gzip.open(File, 'r') as f:
model = {}
for line in f:
splitLines = line.split()
word = splitLines[0].decode("utf-8")
wordEmbedding = np.array([float(value) for value in splitLines[1:]])
model[word] = wordEmbedding
print(len(model)," words loaded!")
return model
Metoda, která načítá model a zároveň ho zpracovává.
def prepare_embedding_matrix_withload(vocab, model_name):
loaded_model_path = api.load(model_name, return_path=True)
embedding_dictionary = load_model(loaded_model_path)
embedding_size = embedding_dictionary['king'].shape[0]
embedding_matrix, num_tokens, hits, misses = prepare_embeddings_matrix(embedding_dictionary, embedding_size, vocab)
return embedding_matrix, embedding_size, num_tokens, hits, misses
Modely, které budou využity v přeneseném učení (transfer learning)
NAME_OF_MODEL_FASTTEXT = "fasttext-wiki-news-subwords-300"
NAME_OF_MODEL_GLOVE = "glove-twitter-200"
def get_fasttext(vocab):
return prepare_embedding_matrix_withload(vocab, NAME_OF_MODEL_FASTTEXT)
def get_glove_twitter(vocab):
return prepare_embedding_matrix_withload(vocab, NAME_OF_MODEL_GLOVE)
Trénovací
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(train.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='label', ylabel='count'>
Validační
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(valid.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='label', ylabel='count'>
Testovací
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(test.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='label', ylabel='count'>
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from nltk.stem import WordNetLemmatizer
import time
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to [nltk_data] /home/usp/pro0255/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. [nltk_data] Downloading package wordnet to [nltk_data] /home/usp/pro0255/nltk_data... [nltk_data] Package wordnet is already up-to-date!
True
Můžeme pozorovat, že v sloupci text se vyskytují textová data v nepředzpracované podobě. Bývá plavidlem, že textová data v sobě obsahují velké množství šumu. Většinou textová data se snažíme upravit do více zpracované podoby. Příklad může být transformace do malých písmen a podobně.
preprocessing_time = {}
TEXT_NORM_1 = "TEXT_NORM_1"
TEXT_NORM_2 = "TEXT_NORM_2"
TEXT_RAW = 'text'
TEXT_CLEANED = 'text_cleaned'
TEXT_CLEANED_2 = 'text_cleaned_2'
ALL_TEXTS = [TEXT_RAW, TEXT_CLEANED, TEXT_CLEANED_2]
train.head()
| Unnamed: 0 | text | label | |
|---|---|---|---|
| 0 | 96460 | @colinmaggs whennn are you coming home? I mis... | 0 |
| 1 | 273757 | im so scare to love my boyfriend deeply...it r... | 0 |
| 2 | 1419228 | @prettyyella Really? Thanks! | 1 |
| 3 | 325136 | Does anyone know a CHEAP motorcycle mechanic i... | 0 |
| 4 | 750551 | @justinlabaw Yes I'm gonna hit you in a few! | 0 |
def gensim_normalization(text):
"""Defined method from gensim for processing text"""
tokens = preprocess_string(text)
return " ".join(tokens)
def remove_non_ascii(words):
"""Remove non-ASCII characters from list of tokenized words"""
new_words = []
for word in words:
new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
new_words.append(new_word)
return new_words
def to_lowercase(words):
"""Convert all characters to lowercase from list of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
def remove_numbers(words):
"""Remove all interger occurrences in list of tokenized words with textual representation"""
new_words = []
for word in words:
new_word = re.sub("\d+", "", word)
if new_word != '':
new_words.append(new_word)
return new_words
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords.words('english'):
new_words.append(word)
return new_words
def stem_words(words):
"""Stem words in list of tokenized words"""
stemmer = LancasterStemmer()
stems = []
for word in words:
stem = stemmer.stem(word)
stems.append(stem)
return stems
def lemmatize_verbs(words):
"""Lemmatize verbs in list of tokenized words"""
lemmatizer = WordNetLemmatizer()
lemmas = []
for word in words:
lemma = lemmatizer.lemmatize(word, pos='v')
lemmas.append(lemma)
return lemmas
def fix_nt(words):
st_res = []
for i in range(0, len(words) - 1):
if words[i+1] == "n't" or words[i+1] == "nt":
st_res.append(words[i]+("n't"))
else:
if words[i] != "n't" and words[i] != "nt":
st_res.append(words[i])
return st_res
def remove_whitespaces(text):
return re.sub(' +', ' ', text)
user_string = 'user'
def replace_with_user(text):
return re.sub('@\w*', user_string, text)
def remove_user(text):
return " ".join([word for word in text.split(' ') if word != user_string])
def normalize(text):
text_with_user = replace_with_user(text)
text = remove_user(text_with_user)
text = remove_whitespaces(text)
words = text.split(' ')
#words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = remove_numbers(words)
words = fix_nt(words)
words = remove_stopwords(words)
words = lemmatize_verbs(words)
return " ".join(words)
Vybraný index pro vizualizaci tweetu po daném preprocessingu.
random_index = 0
test_tweet = train[TEXT_RAW].values[random_index]
test_tweet
'@colinmaggs whennn are you coming home? I miss you and I have nobody to see star trek with!'
normalize(test_tweet)
'whennn come home miss nobody see star trek'
gensim_normalization(test_tweet)
'colinmagg whennn come home miss star trek'
Pomocná metoda pro aplikace normalizační metody na soupec s textem.
def normalize_method(method, key, train=train, test=test, valid=valid):
train[key] = train[TEXT_RAW].apply(method)
test[key] = test[TEXT_RAW].apply(method)
valid[key] = valid[TEXT_RAW].apply(method)
tic = time.time()
normalize_method(normalize, TEXT_CLEANED)
toc = time.time()
preprocessing_time[TEXT_NORM_1] = toc - tic
tic = time.time()
normalize_method(gensim_normalization, TEXT_CLEANED_2)
toc = time.time()
preprocessing_time[TEXT_NORM_2] = toc - tic
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 5) (12000, 5) (4800, 5)
Porovnání běhu předzpracování
pre_time_df = pd.DataFrame.from_dict(preprocessing_time, orient="index")
pre_time_df.columns = ['time']
pre_time_df.head()
| time | |
|---|---|
| TEXT_NORM_1 | 104.507218 |
| TEXT_NORM_2 | 5.756902 |
Můžeme pozorovat, že gensim preprocessing běží mnohem rychleji než námi vlastnoručně definovaný.
fig = px.bar(pre_time_df, y='time')
fig.show()
#TEXT_RAW = 'text'
#TEXT_CLEANED = 'text_cleaned'
#TEXT_CLEANED_2 = 'text_cleaned_2'
SUFFIX = '_len'
def add_len(dataframe, key, suffix=SUFFIX):
train[f'{key}{SUFFIX}'] = train[key].apply(len)
add_len(train, TEXT_RAW)
add_len(train, TEXT_CLEANED)
add_len(train, TEXT_CLEANED_2)
train.head()
| Unnamed: 0 | text | label | text_cleaned | text_cleaned_2 | text_len | text_cleaned_len | text_cleaned_2_len | |
|---|---|---|---|---|---|---|---|---|
| 0 | 96460 | @colinmaggs whennn are you coming home? I mis... | 0 | whennn come home miss nobody see star trek | colinmagg whennn come home miss star trek | 92 | 42 | 41 |
| 1 | 273757 | im so scare to love my boyfriend deeply...it r... | 0 | im scare love boyfriend deeplyit restrict stan... | scare love boyfriend deepli restrict stand leg... | 124 | 87 | 79 |
| 2 | 1419228 | @prettyyella Really? Thanks! | 1 | really | prettyyella thank | 29 | 6 | 17 |
| 3 | 325136 | Does anyone know a CHEAP motorcycle mechanic i... | 0 | anyone know cheap motorcycle mechanic la mean ... | know cheap motorcycl mechan mean recess cheap ... | 135 | 97 | 87 |
| 4 | 750551 | @justinlabaw Yes I'm gonna hit you in a few! | 0 | yes im gonna hit | justinlabaw ye gonna hit | 45 | 16 | 24 |
def XY_len(dataframe, method):
X = [
TEXT_RAW,
TEXT_CLEANED,
TEXT_CLEANED_2
]
Y = [method(train[f'{x}{SUFFIX}']) for x in X]
return X, Y
x, y = XY_len(train, np.mean)
x, y
(['text', 'text_cleaned', 'text_cleaned_2'], [73.92969907407408, 36.48787037037037, 38.72310185185185])
x, y = XY_len(train, np.min)
x, y
(['text', 'text_cleaned', 'text_cleaned_2'], [7, 0, 0])
x, y = XY_len(train, np.max)
x, y
(['text', 'text_cleaned', 'text_cleaned_2'], [374, 164, 361])
dist = train.copy()
dist.head()
| Unnamed: 0 | text | label | text_cleaned | text_cleaned_2 | text_len | text_cleaned_len | text_cleaned_2_len | |
|---|---|---|---|---|---|---|---|---|
| 0 | 96460 | @colinmaggs whennn are you coming home? I mis... | 0 | whennn come home miss nobody see star trek | colinmagg whennn come home miss star trek | 92 | 42 | 41 |
| 1 | 273757 | im so scare to love my boyfriend deeply...it r... | 0 | im scare love boyfriend deeplyit restrict stan... | scare love boyfriend deepli restrict stand leg... | 124 | 87 | 79 |
| 2 | 1419228 | @prettyyella Really? Thanks! | 1 | really | prettyyella thank | 29 | 6 | 17 |
| 3 | 325136 | Does anyone know a CHEAP motorcycle mechanic i... | 0 | anyone know cheap motorcycle mechanic la mean ... | know cheap motorcycl mechan mean recess cheap ... | 135 | 97 | 87 |
| 4 | 750551 | @justinlabaw Yes I'm gonna hit you in a few! | 0 | yes im gonna hit | justinlabaw ye gonna hit | 45 | 16 | 24 |
dist = pd.melt(dist, value_vars=['text_len', 'text_cleaned_len', 'text_cleaned_2_len'])
dist.head()
| variable | value | |
|---|---|---|
| 0 | text_len | 92 |
| 1 | text_len | 124 |
| 2 | text_len | 29 |
| 3 | text_len | 135 |
| 4 | text_len | 45 |
Můžeme vidět, že většina tweetu pro všechny normalizace jsou pod délku 100.
fig = px.histogram(dist, x="value", color="variable")
fig.show()
from tensorflow import string as tf_string
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from enum import Enum
BLANK = '-'
class Fields(Enum):
ModelName = 'ModelName'
BatchSize = 'BatchSize'
Optimizer = 'Optimizer'
LR = 'LR'
Epochs = 'Epochs'
EmbeddingSize = 'EmbeddingSize'
Time = 'Time'
Accuracy = 'Accuracy'
Hits = 'Hits'
Miss = 'Miss'
Key = 'Key'
SeqLen = 'SeqLen'
VocabSize = 'VocabSize'
TrainableEmbedding = 'TrainableEmbedding'
ConfMatrix = "ConfMatrix"
ModelType = "Type"
def create_value(
ModelName=BLANK,
BatchSize=BLANK,
Optimizer=BLANK,
Epochs=BLANK,
EmbeddingSize=BLANK,
Time=BLANK,
Accuracy=BLANK,
LR=BLANK,
Hits=BLANK,
Miss=BLANK,
Key=BLANK,
SeqLen=BLANK,
VocabSize=BLANK,
TrainableEmbedding=BLANK,
ConfMatrix=BLANK,
ModelType=BLANK
):
return {
Fields.ModelName.value: ModelName,
Fields.BatchSize.value: BatchSize,
Fields.Optimizer.value: Optimizer,
Fields.LR.value: LR,
Fields.Epochs.value: Epochs,
Fields.EmbeddingSize.value: EmbeddingSize,
Fields.Time.value: Time,
Fields.Accuracy.value: Accuracy,
Fields.Hits.value: Hits,
Fields.Miss.value: Miss,
Fields.Key.value: Key,
Fields.SeqLen.value: SeqLen,
Fields.VocabSize.value: VocabSize,
Fields.TrainableEmbedding.value: TrainableEmbedding,
Fields.ConfMatrix.value: ConfMatrix,
Fields.ModelType.value: ModelType
}
create_value()
{'ModelName': '-',
'BatchSize': '-',
'Optimizer': '-',
'LR': '-',
'Epochs': '-',
'EmbeddingSize': '-',
'Time': '-',
'Accuracy': '-',
'Hits': '-',
'Miss': '-',
'Key': '-',
'SeqLen': '-',
'VocabSize': '-',
'TrainableEmbedding': '-',
'ConfMatrix': '-',
'Type': '-'}
project_results = {}
BATCH_SIZE = 32
BATCH_SIZES = [
64,
# 128,
# 256
]
LR = 0.001
ADAM = tf.keras.optimizers.Adam(learning_rate = 0.00001)
RMS = tf.keras.optimizers.RMSprop(learning_rate = LR)
OPTIMIZERS = [
ADAM,
RMS
]
EMB_SIZES = [
50,
# 100,
# 150,
# 200,
# 250,
# 300
]
EPOCHS = 10
LOSS = tf.keras.losses.BinaryCrossentropy(from_logits=False)
METRICS = ['accuracy']
PATIENCE = 4
es = keras.callbacks.EarlyStopping(monitor='val_loss', patience=PATIENCE, restore_best_weights=True, mode="auto")
callbacks = [es]
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 8) (12000, 5) (4800, 5)
def get_train_test_valid_from_key(key, train=train, test=test, valid=valid):
X_train, y_train = train[key], train['label']
X_test, y_test = test[key], test['label']
X_valid, y_valid = valid[key], valid['label']
print(f"Train size {X_train.shape}")
print(f"Valid size {X_valid.shape}")
print(f"Test size {X_test.shape}")
return X_train, y_train, X_test, y_test, X_valid, y_valid
#TEXT_RAW = 'text'
#TEXT_CLEANED = 'text_cleaned'
#TEXT_CLEANED_2 = 'text_cleaned_2'
X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(TEXT_RAW);
Train size (43200,) Valid size (4800,) Test size (12000,)
Využita architektura se bude skládat z různých typů RNN vrstev. Důvodem využití je předpokládaná schopnost přečíst vstupní sekvenci slovo po slovu a tímto pochopit kontext. Embedding slov, který bude uvnitř modelu učen by tímto způsobem mohl dosáhnout nejlepších výsledků. Zároveň přepokládáme, že běh modelu nebude příliš dlouhý. Toto můžeme říct, protože délka sekvence, s kterou budeme maximálně pracovat je 100 slov. Z důvodu této délky také budou využity více sofistikovanější RNN typu LSTM a GRU, protože se snažíme zachytit co nejdelší závislosti.
def run_model(
embedding_dim,
vocab_size,
seq_len,
key,
optimizer,
batch_size,
epochs,
):
MODEL_NAME = 'GRU+LSTM_OWN'
X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(key)
embedding_dim = embedding_dim
vocab_size = vocab_size
sequence_length = seq_len
vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(X_train)
voc = vect_layer.get_vocabulary()
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(len(voc), embedding_dim, trainable=True)(x_v)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, activation='relu', return_sequences=True, dropout=0.2, recurrent_dropout=0.2))(emb)
x = keras.layers.GRU(64, activation='relu', return_sequences=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Dense(32, 'relu')(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.Dense(64, 'relu')(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)
model = keras.Model(input_layer, output_layer)
model.summary()
model.compile(optimizer=optimizer, loss=LOSS, metrics=METRICS)
tic = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), callbacks=callbacks, epochs=epochs, batch_size=batch_size)
y_pred = model.predict(X_test).ravel()
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
toc = time.time()
model_value = create_value(
ModelName=MODEL_NAME,
BatchSize=batch_size,
Optimizer=type(optimizer).__name__,
Epochs=epochs,
EmbeddingSize=embedding_dim,
Time=toc-tic,
Accuracy=accuracy,
LR=LR,
Hits=BLANK,
Miss=BLANK,
Key=key,
SeqLen=seq_len,
VocabSize=vocab_size,
TrainableEmbedding=True,
ConfMatrix=conf_matrix,
ModelType="NORMAL"
)
current = len(list(project_results.keys()))
print(current+1)
project_results[current+1] = model_value
Test experiment
#run_model(50, 10000, 100, TEXT_CLEANED, ADAM, 64, 10)
Experiment generator
def generate_default_experiments():
for embedding_size in EMB_SIZES:
for vocab_size in [10000]:
for seq_len in [50, 100]:
for key in [TEXT_RAW, TEXT_CLEANED, TEXT_CLEANED_2]:
for optimizer in [ADAM]:
for batch_size in BATCH_SIZES:
for epoch in [10]:
yield embedding_size, vocab_size, seq_len, key, optimizer, batch_size, epoch
len(list(generate_default_experiments()))
6
for exp_params in generate_default_experiments():
embedding_dim, vocab_size, seq_len, key, optimizer, batch_size, epochs = exp_params
run_model(embedding_dim, vocab_size, seq_len, key, optimizer, batch_size, epochs)
Train size (43200,) Valid size (4800,) Test size (12000,)
2022-03-20 18:18:16.025668: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory 2022-03-20 18:18:16.025718: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-03-20 18:18:16.027745: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 1)] 0
text_vectorization (TextVec (None, 50) 0
torization)
embedding (Embedding) (None, 50, 50) 500000
bidirectional (Bidirectiona (None, 50, 128) 58880
l)
gru (GRU) (None, 64) 37248
batch_normalization (BatchN (None, 64) 256
ormalization)
dropout (Dropout) (None, 64) 0
dense (Dense) (None, 32) 2080
dropout_1 (Dropout) (None, 32) 0
dense_1 (Dense) (None, 64) 2112
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 73s 94ms/step - loss: 0.6184 - accuracy: 0.6190 - val_loss: 0.5885 - val_accuracy: 0.7308
Epoch 2/10
675/675 [==============================] - 61s 90ms/step - loss: 0.4548 - accuracy: 0.7911 - val_loss: 0.5437 - val_accuracy: 0.7465
Epoch 3/10
675/675 [==============================] - 62s 91ms/step - loss: 0.3994 - accuracy: 0.8214 - val_loss: 0.5686 - val_accuracy: 0.7465
Epoch 4/10
675/675 [==============================] - 63s 93ms/step - loss: 0.3556 - accuracy: 0.8437 - val_loss: 0.6003 - val_accuracy: 0.7160
Epoch 5/10
675/675 [==============================] - 61s 91ms/step - loss: 0.3124 - accuracy: 0.8622 - val_loss: 0.5705 - val_accuracy: 0.7648
Epoch 6/10
675/675 [==============================] - 61s 91ms/step - loss: 0.2715 - accuracy: 0.8815 - val_loss: 1.0302 - val_accuracy: 0.6890
1
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 1)] 0
text_vectorization_1 (TextV (None, 50) 0
ectorization)
embedding_1 (Embedding) (None, 50, 50) 500000
bidirectional_1 (Bidirectio (None, 50, 128) 58880
nal)
gru_1 (GRU) (None, 64) 37248
batch_normalization_1 (Batc (None, 64) 256
hNormalization)
dropout_2 (Dropout) (None, 64) 0
dense_3 (Dense) (None, 32) 2080
dropout_3 (Dropout) (None, 32) 0
dense_4 (Dense) (None, 64) 2112
dense_5 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 76s 99ms/step - loss: 0.5980 - accuracy: 0.6579 - val_loss: 1.2539 - val_accuracy: 0.5323
Epoch 2/10
675/675 [==============================] - 66s 98ms/step - loss: 0.5007 - accuracy: 0.7568 - val_loss: 0.5509 - val_accuracy: 0.7200
Epoch 3/10
675/675 [==============================] - 67s 99ms/step - loss: 0.4504 - accuracy: 0.7852 - val_loss: 0.5867 - val_accuracy: 0.7075
Epoch 4/10
675/675 [==============================] - 66s 98ms/step - loss: 0.4077 - accuracy: 0.8098 - val_loss: 0.6143 - val_accuracy: 0.7077
Epoch 5/10
675/675 [==============================] - 66s 98ms/step - loss: 0.3612 - accuracy: 0.8278 - val_loss: 0.6180 - val_accuracy: 0.7050
Epoch 6/10
675/675 [==============================] - 66s 98ms/step - loss: 0.3294 - accuracy: 0.8431 - val_loss: 1.1247 - val_accuracy: 0.5708
2
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 1)] 0
text_vectorization_2 (TextV (None, 50) 0
ectorization)
embedding_2 (Embedding) (None, 50, 50) 500000
bidirectional_2 (Bidirectio (None, 50, 128) 58880
nal)
gru_2 (GRU) (None, 64) 37248
batch_normalization_2 (Batc (None, 64) 256
hNormalization)
dropout_4 (Dropout) (None, 64) 0
dense_6 (Dense) (None, 32) 2080
dropout_5 (Dropout) (None, 32) 0
dense_7 (Dense) (None, 64) 2112
dense_8 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 75s 99ms/step - loss: 0.6936 - accuracy: 0.5038 - val_loss: 0.6935 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 66s 97ms/step - loss: 0.6932 - accuracy: 0.4975 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 61s 91ms/step - loss: 0.6932 - accuracy: 0.5008 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 61s 90ms/step - loss: 0.6932 - accuracy: 0.4987 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 61s 90ms/step - loss: 0.6932 - accuracy: 0.5007 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 60s 89ms/step - loss: 0.6932 - accuracy: 0.4997 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 7/10
675/675 [==============================] - 61s 91ms/step - loss: 0.6932 - accuracy: 0.4992 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 8/10
675/675 [==============================] - 62s 91ms/step - loss: 0.6932 - accuracy: 0.4956 - val_loss: 0.6932 - val_accuracy: 0.5000
3
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 1)] 0
text_vectorization_3 (TextV (None, 100) 0
ectorization)
embedding_3 (Embedding) (None, 100, 50) 500000
bidirectional_3 (Bidirectio (None, 100, 128) 58880
nal)
gru_3 (GRU) (None, 64) 37248
batch_normalization_3 (Batc (None, 64) 256
hNormalization)
dropout_6 (Dropout) (None, 64) 0
dense_9 (Dense) (None, 32) 2080
dropout_7 (Dropout) (None, 32) 0
dense_10 (Dense) (None, 64) 2112
dense_11 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 128s 176ms/step - loss: 0.6934 - accuracy: 0.5020 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 129s 192ms/step - loss: 0.6932 - accuracy: 0.4949 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 128s 190ms/step - loss: 0.6932 - accuracy: 0.5023 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.5017 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.4931 - val_loss: 0.6932 - val_accuracy: 0.5000
4
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 1)] 0
text_vectorization_4 (TextV (None, 100) 0
ectorization)
embedding_4 (Embedding) (None, 100, 50) 500000
bidirectional_4 (Bidirectio (None, 100, 128) 58880
nal)
gru_4 (GRU) (None, 64) 37248
batch_normalization_4 (Batc (None, 64) 256
hNormalization)
dropout_8 (Dropout) (None, 64) 0
dense_12 (Dense) (None, 32) 2080
dropout_9 (Dropout) (None, 32) 0
dense_13 (Dense) (None, 64) 2112
dense_14 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 128s 177ms/step - loss: 0.6935 - accuracy: 0.4988 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 116s 172ms/step - loss: 0.6933 - accuracy: 0.4991 - val_loss: 0.6937 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 117s 173ms/step - loss: 0.6932 - accuracy: 0.4966 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 115s 171ms/step - loss: 0.6932 - accuracy: 0.4957 - val_loss: 0.6933 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 127s 188ms/step - loss: 0.6932 - accuracy: 0.4993 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.4993 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 7/10
675/675 [==============================] - 129s 192ms/step - loss: 0.6932 - accuracy: 0.5026 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 8/10
675/675 [==============================] - 131s 194ms/step - loss: 0.6932 - accuracy: 0.5010 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 9/10
675/675 [==============================] - 123s 182ms/step - loss: 0.6932 - accuracy: 0.4987 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 10/10
675/675 [==============================] - 117s 173ms/step - loss: 0.6932 - accuracy: 0.4978 - val_loss: 0.6932 - val_accuracy: 0.5000
5
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) [(None, 1)] 0
text_vectorization_5 (TextV (None, 100) 0
ectorization)
embedding_5 (Embedding) (None, 100, 50) 500000
bidirectional_5 (Bidirectio (None, 100, 128) 58880
nal)
gru_5 (GRU) (None, 64) 37248
batch_normalization_5 (Batc (None, 64) 256
hNormalization)
dropout_10 (Dropout) (None, 64) 0
dense_15 (Dense) (None, 32) 2080
dropout_11 (Dropout) (None, 32) 0
dense_16 (Dense) (None, 64) 2112
dense_17 (Dense) (None, 1) 65
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 126s 173ms/step - loss: 0.6936 - accuracy: 0.5017 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 116s 172ms/step - loss: 0.6933 - accuracy: 0.4936 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 121s 179ms/step - loss: 0.6932 - accuracy: 0.4980 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 123s 181ms/step - loss: 0.6933 - accuracy: 0.5027 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 123s 182ms/step - loss: 0.6932 - accuracy: 0.5004 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 146s 216ms/step - loss: 0.6932 - accuracy: 0.5005 - val_loss: 0.6932 - val_accuracy: 0.5000
6
pd.DataFrame.from_dict(project_results, orient="index")
| ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465, 535], [2547, 3453]] | NORMAL |
| 2 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 416.496985 | 0.725000 | - | - | text_cleaned | 50 | 10000 | True | [[3834, 2166], [1134, 4866]] | NORMAL |
| 3 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 514.631196 | 0.500000 | - | - | text_cleaned_2 | 50 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 4 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 657.701281 | 0.500000 | - | - | text | 100 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 5 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 1246.870321 | 0.500000 | - | - | text_cleaned | 100 | 10000 | True | [[6000, 0], [6000, 0]] | NORMAL |
| 6 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 769.644757 | 0.500000 | - | - | text_cleaned_2 | 100 | 10000 | True | [[6000, 0], [6000, 0]] | NORMAL |
first = pd.DataFrame.from_dict(project_results, orient='index')
first.to_csv("first.csv", sep=';', index=False)
V rámci této části budou primárně provedeny experimenty se zaměření na modely s architekturou Transformers.
Přesněji budou využity dva typy:
Více informací o Transformerech bude sepsáno níže v zhodnocení.
DistilBertBaseUncased = "distilbert-base-uncased"
BertBaseUncased = "bert-base-uncased"
from transformers import TFAutoModel
from transformers import AutoTokenizer
def tokenize(sentences, tokenizer, max_length, padding='max_length'):
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
def run_transformer_model(
transformer_name,
output_sequence_length,
key,
loss,
optimizer,
batch_size,
epochs,
lr
):
MODEL_NAME = "Transformer"
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(key)
train_ds = tf.data.Dataset.from_tensor_slices((
dict(tokenize(list(X_train), tokenizer, output_sequence_length)),
y_train
)).batch(batch_size).prefetch(1)
valid_ds = tf.data.Dataset.from_tensor_slices((
dict(tokenize(list(X_valid), tokenizer, output_sequence_length)),
y_valid
)).batch(batch_size).prefetch(1)
test_ds = tf.data.Dataset.from_tensor_slices((
dict(tokenize(list(X_test), tokenizer, output_sequence_length)),
y_test
)).batch(1).prefetch(1)
base = TFAutoModel.from_pretrained(transformer_name)
input_ids = tf.keras.layers.Input(shape=(output_sequence_length,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((output_sequence_length,), dtype=tf.int32, name='attention_mask')
#Selection of cls
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]
output = tf.keras.layers.Dropout(
rate=0.15,
)(output)
output = tf.keras.layers.Dense(
units=64,
activation='relu',
)(output)
output = tf.keras.layers.BatchNormalization()(output)
output = tf.keras.layers.Dense(
units=64,
activation='relu',
)(output)
output = tf.keras.layers.BatchNormalization()(output)
output_layer = tf.keras.layers.Dense(
units=1,
activation='sigmoid'
)(output)
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output_layer)
model.summary()
optimizer = optimizer(learning_rate=lr)
model.compile(
loss=loss,
optimizer=optimizer,
metrics=METRICS
)
tic = time.time()
history = model.fit(
train_ds,
validation_data=valid_ds,
epochs=epochs,
callbacks=callbacks
)
y_pred = model.predict(test_ds).ravel()
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
toc = time.time()
model_value = create_value(
ModelName=MODEL_NAME,
BatchSize=batch_size,
Optimizer=type(optimizer).__name__,
Epochs=epochs,
EmbeddingSize=0,
Time=toc-tic,
Accuracy=accuracy,
LR=lr,
Hits=0,
Miss=0,
Key=key,
SeqLen=seq_len,
VocabSize=vocab_size,
TrainableEmbedding=trainable,
ConfMatrix=conf_matrix,
ModelType="TL"
)
current = len(list(project_results.keys()))
print(current)
project_results[current+1] = model_value
def generate_transf_experiments():
for transformer_name in [DistilBertBaseUncased, BertBaseUncased]:
for seq_len in [100]:
for key in [TEXT_CLEANED, TEXT_RAW]:
for batch_size in [64]:
for epoch in [2, 5]:
for lr in [5e-5]:
yield transformer_name, seq_len, key, batch_size, epoch, lr
transformer_experiments = list(generate_transf_experiments())
len(list(generate_transf_experiments()))
8
for exp in generate_transf_experiments():
transformer_name, seq_len, key, batch_size, epoch, lr = exp
print(exp)
run_transformer_model(
transformer_name=transformer_name,
output_sequence_length=seq_len,
key=key,
loss=LOSS,
optimizer=tf.keras.optimizers.Adam,
batch_size=batch_size,
epochs=epoch,
lr=lr
)
('distilbert-base-uncased', 100, 'text_cleaned', 64, 2, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector'] - This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_22"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 100)] 0 []
attention_mask (InputLayer) [(None, 100)] 0 []
tf_distil_bert_model_5 (TFDist TFBaseModelOutput(l 66362880 ['input_ids[0][0]',
ilBertModel) ast_hidden_state=(N 'attention_mask[0][0]']
one, 100, 768),
hidden_states=None
, attentions=None)
tf.__operators__.getitem_5 (Sl (None, 768) 0 ['tf_distil_bert_model_5[0][0]']
icingOpLambda)
dropout_173 (Dropout) (None, 768) 0 ['tf.__operators__.getitem_5[0][0
]']
dense_114 (Dense) (None, 64) 49216 ['dropout_173[0][0]']
batch_normalization_18 (BatchN (None, 64) 256 ['dense_114[0][0]']
ormalization)
dense_115 (Dense) (None, 64) 4160 ['batch_normalization_18[0][0]']
batch_normalization_19 (BatchN (None, 64) 256 ['dense_115[0][0]']
ormalization)
dense_116 (Dense) (None, 1) 65 ['batch_normalization_19[0][0]']
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/2
675/675 [==============================] - 3676s 5s/step - loss: 0.5910 - accuracy: 0.6916 - val_loss: 0.5441 - val_accuracy: 0.7467
Epoch 2/2
675/675 [==============================] - 3776s 6s/step - loss: 0.5026 - accuracy: 0.7579 - val_loss: 0.5316 - val_accuracy: 0.7500
6
('distilbert-base-uncased', 100, 'text_cleaned', 64, 5, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector'] - This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_23"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 100)] 0 []
attention_mask (InputLayer) [(None, 100)] 0 []
tf_distil_bert_model_6 (TFDist TFBaseModelOutput(l 66362880 ['input_ids[0][0]',
ilBertModel) ast_hidden_state=(N 'attention_mask[0][0]']
one, 100, 768),
hidden_states=None
, attentions=None)
tf.__operators__.getitem_6 (Sl (None, 768) 0 ['tf_distil_bert_model_6[0][0]']
icingOpLambda)
dropout_193 (Dropout) (None, 768) 0 ['tf.__operators__.getitem_6[0][0
]']
dense_117 (Dense) (None, 64) 49216 ['dropout_193[0][0]']
batch_normalization_20 (BatchN (None, 64) 256 ['dense_117[0][0]']
ormalization)
dense_118 (Dense) (None, 64) 4160 ['batch_normalization_20[0][0]']
batch_normalization_21 (BatchN (None, 64) 256 ['dense_118[0][0]']
ormalization)
dense_119 (Dense) (None, 1) 65 ['batch_normalization_21[0][0]']
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/5
675/675 [==============================] - 3735s 6s/step - loss: 0.5864 - accuracy: 0.6921 - val_loss: 0.5297 - val_accuracy: 0.7306
Epoch 2/5
675/675 [==============================] - 3493s 5s/step - loss: 0.4970 - accuracy: 0.7567 - val_loss: 0.5334 - val_accuracy: 0.7415
Epoch 3/5
675/675 [==============================] - 2907s 4s/step - loss: 0.4233 - accuracy: 0.8046 - val_loss: 0.5654 - val_accuracy: 0.7315
Epoch 4/5
675/675 [==============================] - 2925s 4s/step - loss: 0.3237 - accuracy: 0.8586 - val_loss: 0.7244 - val_accuracy: 0.7248
Epoch 5/5
675/675 [==============================] - 2982s 4s/step - loss: 0.2299 - accuracy: 0.9032 - val_loss: 0.9257 - val_accuracy: 0.7194
7
('distilbert-base-uncased', 100, 'text', 64, 2, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector'] - This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_24"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 100)] 0 []
attention_mask (InputLayer) [(None, 100)] 0 []
tf_distil_bert_model_7 (TFDist TFBaseModelOutput(l 66362880 ['input_ids[0][0]',
ilBertModel) ast_hidden_state=(N 'attention_mask[0][0]']
one, 100, 768),
hidden_states=None
, attentions=None)
tf.__operators__.getitem_7 (Sl (None, 768) 0 ['tf_distil_bert_model_7[0][0]']
icingOpLambda)
dropout_213 (Dropout) (None, 768) 0 ['tf.__operators__.getitem_7[0][0
]']
dense_120 (Dense) (None, 64) 49216 ['dropout_213[0][0]']
batch_normalization_22 (BatchN (None, 64) 256 ['dense_120[0][0]']
ormalization)
dense_121 (Dense) (None, 64) 4160 ['batch_normalization_22[0][0]']
batch_normalization_23 (BatchN (None, 64) 256 ['dense_121[0][0]']
ormalization)
dense_122 (Dense) (None, 1) 65 ['batch_normalization_23[0][0]']
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/2
675/675 [==============================] - 2849s 4s/step - loss: 0.4599 - accuracy: 0.7866 - val_loss: 0.3946 - val_accuracy: 0.8204
Epoch 2/2
675/675 [==============================] - 2824s 4s/step - loss: 0.3379 - accuracy: 0.8545 - val_loss: 0.4810 - val_accuracy: 0.8213
8
('distilbert-base-uncased', 100, 'text', 64, 5, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector'] - This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_25"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 100)] 0 []
attention_mask (InputLayer) [(None, 100)] 0 []
tf_distil_bert_model_8 (TFDist TFBaseModelOutput(l 66362880 ['input_ids[0][0]',
ilBertModel) ast_hidden_state=(N 'attention_mask[0][0]']
one, 100, 768),
hidden_states=None
, attentions=None)
tf.__operators__.getitem_8 (Sl (None, 768) 0 ['tf_distil_bert_model_8[0][0]']
icingOpLambda)
dropout_233 (Dropout) (None, 768) 0 ['tf.__operators__.getitem_8[0][0
]']
dense_123 (Dense) (None, 64) 49216 ['dropout_233[0][0]']
batch_normalization_24 (BatchN (None, 64) 256 ['dense_123[0][0]']
ormalization)
dense_124 (Dense) (None, 64) 4160 ['batch_normalization_24[0][0]']
batch_normalization_25 (BatchN (None, 64) 256 ['dense_124[0][0]']
ormalization)
dense_125 (Dense) (None, 1) 65 ['batch_normalization_25[0][0]']
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/5
675/675 [==============================] - 3005s 4s/step - loss: 0.4586 - accuracy: 0.7878 - val_loss: 0.4173 - val_accuracy: 0.8202
Epoch 2/5
675/675 [==============================] - 2947s 4s/step - loss: 0.3397 - accuracy: 0.8525 - val_loss: 0.4195 - val_accuracy: 0.8215
Epoch 3/5
67/675 [=>............................] - ETA: 44:14 - loss: 0.2793 - accuracy: 0.8846
pd.DataFrame.from_dict(project_results, orient="index")
second = pd.DataFrame.from_dict(project_results, orient='index')
second.to_csv("second.csv", sep=';', index=False)
second
| ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465, 535], [2547, 3453]] | NORMAL |
| 2 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 416.496985 | 0.725000 | - | - | text_cleaned | 50 | 10000 | True | [[3834, 2166], [1134, 4866]] | NORMAL |
| 3 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 514.631196 | 0.500000 | - | - | text_cleaned_2 | 50 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 4 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 657.701281 | 0.500000 | - | - | text | 100 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 5 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 1246.870321 | 0.500000 | - | - | text_cleaned | 100 | 10000 | True | [[6000, 0], [6000, 0]] | NORMAL |
| 6 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 769.644757 | 0.500000 | - | - | text_cleaned_2 | 100 | 10000 | True | [[6000, 0], [6000, 0]] | NORMAL |
| 7 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 8299.741099 | 0.740000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4221, 1779], [1341, 4659]] | TL |
| 8 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 16710.604886 | 0.733250 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4575, 1425], [1776, 4224]] | TL |
| 9 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 6358.123754 | 0.821833 | 0 | 0 | text | 100 | 10000 | True | [[4994, 1006], [1132, 4868]] | TL |
| 10 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 15388.836932 | 0.818750 | 0 | 0 | text | 100 | 10000 | True | [[5191, 809], [1366, 4634]] | TL |
| 11 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 12779.165130 | 0.742333 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4398, 1602], [1490, 4510]] | TL |
| 12 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 30525.127602 | 0.500000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[6000, 0], [6000, 0]] | TL |
| 13 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 13411.460902 | 0.823917 | 0 | 0 | text | 100 | 10000 | True | [[5346, 654], [1459, 4541]] | TL |
| 14 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 31275.351393 | 0.827750 | 0 | 0 | text | 100 | 10000 | True | [[4958, 1042], [1025, 4975]] | TL |
results_df = pd.DataFrame.from_dict(project_results, orient="index")
results_df.head()
| ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465, 535], [2547, 3453]] | NORMAL |
| 2 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 416.496985 | 0.725000 | - | - | text_cleaned | 50 | 10000 | True | [[3834, 2166], [1134, 4866]] | NORMAL |
| 3 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 514.631196 | 0.500000 | - | - | text_cleaned_2 | 50 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 4 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 657.701281 | 0.500000 | - | - | text | 100 | 10000 | True | [[0, 6000], [0, 6000]] | NORMAL |
| 5 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 1246.870321 | 0.500000 | - | - | text_cleaned | 100 | 10000 | True | [[6000, 0], [6000, 0]] | NORMAL |
Uložení výsledků na disk, pro případné pozdější vyhodnocení
path_to_save = os.path.sep.join(['.', "results.csv"])
path_to_save
'./results.csv'
results_df.to_csv(path_to_save, sep=';')
Načtení výsledků
results_df = pd.read_csv(path_to_save, sep=';')
Výsledky všech experimentů, které byly provedeny.
results_df
| Unnamed: 0 | ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465 535]\n [2547 3453]] | NORMAL |
| 1 | 2 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 416.496985 | 0.725000 | - | - | text_cleaned | 50 | 10000 | True | [[3834 2166]\n [1134 4866]] | NORMAL |
| 2 | 3 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 514.631196 | 0.500000 | - | - | text_cleaned_2 | 50 | 10000 | True | [[ 0 6000]\n [ 0 6000]] | NORMAL |
| 3 | 4 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 657.701281 | 0.500000 | - | - | text | 100 | 10000 | True | [[ 0 6000]\n [ 0 6000]] | NORMAL |
| 4 | 5 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 1246.870321 | 0.500000 | - | - | text_cleaned | 100 | 10000 | True | [[6000 0]\n [6000 0]] | NORMAL |
| 5 | 6 | GRU+LSTM_OWN | 64 | Adam | 0.00100 | 10 | 50 | 769.644757 | 0.500000 | - | - | text_cleaned_2 | 100 | 10000 | True | [[6000 0]\n [6000 0]] | NORMAL |
| 6 | 7 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 8299.741099 | 0.740000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4221 1779]\n [1341 4659]] | TL |
| 7 | 8 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 16710.604886 | 0.733250 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4575 1425]\n [1776 4224]] | TL |
| 8 | 9 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 6358.123754 | 0.821833 | 0 | 0 | text | 100 | 10000 | True | [[4994 1006]\n [1132 4868]] | TL |
| 9 | 10 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 15388.836932 | 0.818750 | 0 | 0 | text | 100 | 10000 | True | [[5191 809]\n [1366 4634]] | TL |
| 10 | 11 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 12779.165130 | 0.742333 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4398 1602]\n [1490 4510]] | TL |
| 11 | 12 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 30525.127602 | 0.500000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[6000 0]\n [6000 0]] | TL |
| 12 | 13 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 13411.460902 | 0.823917 | 0 | 0 | text | 100 | 10000 | True | [[5346 654]\n [1459 4541]] | TL |
| 13 | 14 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 31275.351393 | 0.827750 | 0 | 0 | text | 100 | 10000 | True | [[4958 1042]\n [1025 4975]] | TL |
TransformerName = "Transformer"
RnnName = "GRU+LSTM_OWN"
from plotly.subplots import make_subplots
import plotly.graph_objects as go
rnn_own_architecture = results_df[results_df.ModelName == RnnName]
transformer_architecture = results_df[results_df.ModelName == TransformerName]
rnn_own_architecture
| Unnamed: 0 | ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465 535]\n [2547 3453]] | NORMAL |
| 1 | 2 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 416.496985 | 0.725000 | - | - | text_cleaned | 50 | 10000 | True | [[3834 2166]\n [1134 4866]] | NORMAL |
| 2 | 3 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 514.631196 | 0.500000 | - | - | text_cleaned_2 | 50 | 10000 | True | [[ 0 6000]\n [ 0 6000]] | NORMAL |
| 3 | 4 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 657.701281 | 0.500000 | - | - | text | 100 | 10000 | True | [[ 0 6000]\n [ 0 6000]] | NORMAL |
| 4 | 5 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 1246.870321 | 0.500000 | - | - | text_cleaned | 100 | 10000 | True | [[6000 0]\n [6000 0]] | NORMAL |
| 5 | 6 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 769.644757 | 0.500000 | - | - | text_cleaned_2 | 100 | 10000 | True | [[6000 0]\n [6000 0]] | NORMAL |
fig = px.bar(rnn_own_architecture, x="Key", y="Accuracy", color="Key", barmode="group", facet_row="SeqLen", text='Accuracy')
Na grafu níže lze pozorovat, že většina běhů experimentů měla problém s učením a síť nebyla schopná predikovat polaritu tweetu. Jediné dva experimenty s velikostí sekvence 50 daly relativně vhodné výsledky kolem 70 procent.
Bylo by vhodné dále pokračovat se zkoumáním, proč výsledky nedopadly a síť neměla tendenci konvergovat k očekávanému vnitřnímu stavu.
fig.show()
best_rnn = rnn_own_architecture.sort_values(by="Accuracy", ascending=False).iloc[0, :]
best_rnn
Unnamed: 0 1 ModelName GRU+LSTM_OWN BatchSize 64 Optimizer Adam LR 0.001 Epochs 10 EmbeddingSize 50 Time 389.583837 Accuracy 0.743167 Hits - Miss - Key text SeqLen 50 VocabSize 10000 TrainableEmbedding True ConfMatrix [[5465 535]\n [2547 3453]] Type NORMAL Name: 0, dtype: object
Nejlepší výsledek v rámci rekurentních neuronových sítí dosahoval 74 procent. Šlo o surový text, na kterém nebyla aplikováno žádné předzpracování.
best_rnn.Accuracy
0.7431666666666666
transformer_architecture
| Unnamed: 0 | ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 7 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 8299.741099 | 0.740000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4221 1779]\n [1341 4659]] | TL |
| 7 | 8 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 16710.604886 | 0.733250 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4575 1425]\n [1776 4224]] | TL |
| 8 | 9 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 6358.123754 | 0.821833 | 0 | 0 | text | 100 | 10000 | True | [[4994 1006]\n [1132 4868]] | TL |
| 9 | 10 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 15388.836932 | 0.818750 | 0 | 0 | text | 100 | 10000 | True | [[5191 809]\n [1366 4634]] | TL |
| 10 | 11 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 12779.165130 | 0.742333 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4398 1602]\n [1490 4510]] | TL |
| 11 | 12 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 30525.127602 | 0.500000 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[6000 0]\n [6000 0]] | TL |
| 12 | 13 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 13411.460902 | 0.823917 | 0 | 0 | text | 100 | 10000 | True | [[5346 654]\n [1459 4541]] | TL |
| 13 | 14 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 31275.351393 | 0.827750 | 0 | 0 | text | 100 | 10000 | True | [[4958 1042]\n [1025 4975]] | TL |
transformer_architecture['Accuracy'] = list(map(lambda x: round(x, 3), transformer_architecture['Accuracy'].values))
/tmp/ipykernel_15959/3860581467.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
extended_transformer_architecture = transformer_architecture.copy()
extended_transformer_architecture['TT'] = list(map(lambda x: x[0], transformer_experiments))
extended_transformer_architecture.head()
| Unnamed: 0 | ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | TT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 7 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 8299.741099 | 0.740 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4221 1779]\n [1341 4659]] | TL | distilbert-base-uncased |
| 7 | 8 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 16710.604886 | 0.733 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4575 1425]\n [1776 4224]] | TL | distilbert-base-uncased |
| 8 | 9 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 6358.123754 | 0.822 | 0 | 0 | text | 100 | 10000 | True | [[4994 1006]\n [1132 4868]] | TL | distilbert-base-uncased |
| 9 | 10 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 15388.836932 | 0.819 | 0 | 0 | text | 100 | 10000 | True | [[5191 809]\n [1366 4634]] | TL | distilbert-base-uncased |
| 10 | 11 | Transformer | 64 | Adam | 0.00005 | 2 | 0 | 12779.165130 | 0.742 | 0 | 0 | text_cleaned | 100 | 10000 | True | [[4398 1602]\n [1490 4510]] | TL | bert-base-uncased |
fig = px.bar(
extended_transformer_architecture,
x="Key",
y="Accuracy",
color="Key",
barmode="group",
facet_col="Epochs",
facet_row="TT",
text='Accuracy'
)
Na grafu níže lze pozorovat, že Distilbert dával totožné výsledky jako vetší model Bert. Zároveň lze pozorovat, že počet epoch stačí malý na to, aby byl schopen model predikovat nové záznamy s celkem vysokou přesností. Předzpracování hrálo v potaz surovému textu, na který nebylo aplikováno žádné předzpracování.
fig.show()
best_transformer = extended_transformer_architecture.sort_values(by="Accuracy", ascending=False).iloc[0, :]
best_transformer
Unnamed: 0 14 ModelName Transformer BatchSize 64 Optimizer Adam LR 0.00005 Epochs 5 EmbeddingSize 0 Time 31275.351393 Accuracy 0.828 Hits 0 Miss 0 Key text SeqLen 100 VocabSize 10000 TrainableEmbedding True ConfMatrix [[4958 1042]\n [1025 4975]] Type TL TT bert-base-uncased Name: 13, dtype: object
times = {}
rnn_time = np.mean(rnn_own_architecture.Time)
times['rnn'] = rnn_time
selector = (extended_transformer_architecture.Epochs == 2) & (extended_transformer_architecture.TT == DistilBertBaseUncased)
distil_2_time = np.mean(extended_transformer_architecture[selector].Time)
times['distil_2'] = distil_2_time
selector = (extended_transformer_architecture.Epochs == 5) & (extended_transformer_architecture.TT == DistilBertBaseUncased)
distil_5_time = np.mean(extended_transformer_architecture[selector].Time)
times['distil_5'] = distil_5_time
selector = (extended_transformer_architecture.Epochs == 2) & (extended_transformer_architecture.TT == BertBaseUncased)
bert_2_time = np.mean(extended_transformer_architecture[selector].Time)
times['bert_2'] = bert_2_time
selector = (extended_transformer_architecture.Epochs == 5) & (extended_transformer_architecture.TT == BertBaseUncased)
bert_5_time = np.mean(extended_transformer_architecture[selector].Time)
times['bert_5'] = bert_5_time
times_res = pd.DataFrame.from_dict(times, orient="index")
times_res = times_res.reset_index()
times_res.columns = ['name', 'time']
Jak lze pozorovat Distilbert běžel o polovinu kratší čas než Bert. Zároveň větší počet epoch, jak již bylo zmíněno, nemá smysl z důvodu stejné přesnosti. V těchto velkých modelech stačí malé množství epoch, jelikož při vyším počtu hrozí přeučení a případné zapomenutí těžce naučeného porozumění jazyka.
px.bar(times_res, x='name', y='time', title="Průměrná délka učení v závilosti na typu")
best_rnn['TT'] = '-'
best = pd.concat([pd.DataFrame(best_rnn).T, pd.DataFrame(best_transformer).T])
best
| Unnamed: 0 | ModelName | BatchSize | Optimizer | LR | Epochs | EmbeddingSize | Time | Accuracy | Hits | Miss | Key | SeqLen | VocabSize | TrainableEmbedding | ConfMatrix | Type | TT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | GRU+LSTM_OWN | 64 | Adam | 0.001 | 10 | 50 | 389.583837 | 0.743167 | - | - | text | 50 | 10000 | True | [[5465 535]\n [2547 3453]] | NORMAL | - |
| 13 | 14 | Transformer | 64 | Adam | 0.00005 | 5 | 0 | 31275.351393 | 0.828 | 0 | 0 | text | 100 | 10000 | True | [[4958 1042]\n [1025 4975]] | TL | bert-base-uncased |
Na následujícím grafu lze pozorovat, že přenesené učení s využitím transformer modelu dosáhlo o 8 procent větší přesnosti.
px.bar(best, x='ModelName', y='Accuracy')
Nutné k lepší přesnosti dodat, že cenou je obrovská náročnost na čas. Ačkoliv Transformer modely jdou optimalizovat pomocí paralelních výpočtů.
px.bar(best, x='ModelName', y='Time')
V projektu bylo pracováno s datovou sadou obsahující 1,6mil tweetu. Z této obrovské množiny jsme si vytvořili trénovací, testovací a validační množinu, které byly použity při učení. Tyto množiny byly vytvořeny z 60 000 tweetu. Důvodem bylo zajištění rychlejšího učení, provedení experimentů a jejich vyhodnocení.
Jak již bylo zmíněno výše datová sada byla rozdělena na 3 množiny:
Trénovací - 70 % Testovací - 20 % Validační - 10 %
Datová sada byla vyvážená, a tak jsme měli možnost využít vcelku jednoduchou metriku přesnosti (Accuracy), která přistupuje k výpočtu pomocí (počet správně predikovaných / počet všech).
Tato metrika nám popíše, jak přesně model odhaduje správně polaritu tweetu. Vyzkoušení kvality modelu bylo vyzkoušeno na 12 000 tweetech (20 %).
V projektu se určovalo polarity daného tweetu. Přesněji zda tweet je pozitivní nebo negativní. Tyto stavy lze jednoduše vyjádřit pomocí binární soustavy. 0 pokud tweet je negativní, 1 pokud je pozitivní.
Pomocí těchto hodnot byla neuronová síť učena. Chyba byla vypočtena dle BinaryCrossentropy chybové funkce.
Na textová data bylo vyzkoušeny 3 druhy předzpracování:
Surové - žádné předzpracování nebylo aplikováno. 1 předzpracování (text_cleaned) - předzpracování napsáno vlastními silami. 2 předzpracování (text_cleaned_2) - gensim metoda.
Paradoxem je, že výsledky poukázaly, že předzpracování až tak nepomohlo. Ačkoliv toto mohlo být způsobeno špatně provedeným předzpracováním.
Vždy je nutné v tomhle ohledu přistupovat k textu rezervovaně, jelikož aplikací předzpracování můžeme odstranit z textu informace, které by nakonec byly pro model velice důležité.
Důvod, proč bychom chtěli předzpracování aplikovat je právě víra v:
Vlastní architektura byla vytvořena prostřednictvím rekurentních neuronových sítí. Přesněji byly využity buňky LSTM, oboustranné, tak abychom zachytili význam slova z obou stran a následovala GRU vstva, která většinou dává stejně dobré výsledky jako LSTM, ale v lepším čase. Po těchto vrstvách následovala hluboká neuronová síť ve spojení s instrumenty jako BatchNormalization a Dropout, abychom se model pokusili alespoň trošku optimalizovat a vyhnout se přeučení.
Tento model v nejlepším případě dosáhl výsledku 74 procent. Konfigurace byla:
V rámci přeneseného učení byla využita neuronová síť s architekturou Transformers. V posledních letech tyto modely dosahují nejlepších výsledku, což se v našem projektu také potvrdilo. Byly vyzkoušeny dva Transformery:
Ve zkratce oba tyto modely jsou učeny stejným způsobem, ačkoliv DistilBert je menší model obsahující o 40 procent parametrů méně než Bert. Učení pak je rychlejší i když jsou výsledky zachovány.
Transformer model v sobě obsahuje předučenou reprezentaci anglického jazyka, který se lehce modifikuje na našem problému určení polarity tweetu. Predikce pak vychází z přidané "hlavy" modelu, a to hluboké neuronové sítě.
Z výsledků jsme potvrdili opravdu, že modely dávaly mnohem lepší výsledky než RNN síť, a to o celých 8 procent. Ačkoliv je nutné znovu zmínit, že čas běhu byl mnohonásobně delší. Zároveň šlo pozorovat, že počet epoch pro tyto modely nemusí být vysoké číslo a dostatečně stačí 2-5 epoch v závislosti na velikosti vstupní datové sady. Důležité je myslet na nastavení malé učící konstanty, aby mohl model konvergovat k dobrým výsledkům.
Nejlepší konfigurace:
Tato konfigurace dosáhla skoro 83 procent přesnosti. Přesněji se zaokrouhlením 82.8 procent.
Pokud máme dostatečné množství času, tak použití Transformer modelů se vyplatí, jelikož v problémech zpracování přirozeného jazyka dosáhneme takových výsledků, které ostatní modely většinou nemají šanci dosáhnout.